Network-based clustering with mixtures of i\ -penalized 
Gaussian graphical models: an empirical investigation 



In many applications, multivariate samples may harbor previously unrecognized hetero- 
geneity at the level of conditional independence or network structure. For example, in can- 
cer biology, disease subtypes may differ with respect to subtype-specific interplay between 
molecular components. Then, both subtype discovery and estimation of subtype-specific net- 
works present important and related challenges. To enable such analyses, we put forward a 
mixture model whose components are sparse Gaussian graphical models. This brings together 
model-based clustering and graphical modeling to permit simultaneous estimation of cluster 
assignments and cluster-specific networks. We carry out estimation within an l\ -penalized 
framework, and investigate several specific penalization regimes. We present empirical re- 
sults on simulated data and provide general recommendations for the formulation and use of 
mixtures of l\ -penalized Gaussian graphical models. 
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1 Introduction 



Clustering of high-dimensional data has been the focus of much statistical research over the past 
decade. The increasing prevalence of high-throughput biological data has been an important mo- 
tivation for such efforts. In molecular biology applications, clustering can be used to either group 
variables together (e.g. to find sets of co-regulated genes; |Eisen et al\ |1998[ |Toh and Horimoto" 



2002 1, group samples together (e.g. to discover disease subtypes characterized by similar gene 



expression profiles; Golub et al. 1999; Alizade h et al. \ [2000 1, or to simultaneously group both 
variables together and samples together ('bi-clustering' methods; Alo n et al\ 1999[ Madeira and 
Oliveira |2004| ). In this work we focus on the second of these approaches; that is, to cluster a 
small-to-moderate number of high-dimensional samples. Numerous clustering algorithms have 
been used in biological applications, notably for gene expression data; see [Datta and Datta| ( |2003| >; 
Thalamuthu et al. ( 2006[ ); Kerr et al. (2008) and de Souto et al. ( 2008[ ) for reviews and compar- 



isons of various methods, including K-means, hierarchical clustering and model-based clustering. 
Model-based clustering (Fraley and Rafteryj 1998 McLachlan et al. \ 2002 1 with Gaussian mixture 
models is a popular approach to clustering that is rooted in an explicit statistical model. 

Another area that has received much attention in recent years is structural inference for graph- 
ical models. In a graphical model, a graph, comprising vertices and linking edges, is used to 
describe probabilistic relationships between variables; structural inference refers to estimation of 
the graph edge structure. In bioinformatics applications, structural inference is important for the 
elucidation of molecular networks, such as gene regulatory or protein signaling networks, from 
biochemical data. Many methods for structural inference have been proposed in the literature, 
including those based on Bayesian networks (Fried man et a/.|[2000t|Husmeier||2003||Segal et al.\ 
20031 |Needham etal\ [2007} |Mukherjee and Speed[ |2"008j |Hill et al.\ |20"T2] ) and Gaussian graphi- 



cal models (Schaferandjjtr^^ 2008| >. They 

are reviewed, along with other approaches, in Lee and Tzou| ( |2009] >; |Hecker et al. (2009) and 
Mar kowetz and Spang] ( |2007| ) . 



In this paper, we develop a model-based clustering approach with components defined by 
graphical models. This allows simultaneous recovery of cluster assignments and estimation of 
cluster-specific graphical model structure. Our work is of particular relevance to questions con- 
cerning undiscovered heterogeneity at the level of network structure. Such questions arise in 
diverse molecular biology applications. The edge structure of biological networks can differ de- 
pending on context, e.g. disease state or other subtype, in ways that may have implications for 
targeted and personalized therapies ( Pe'er and Hacohen] 201 \\ . When such heterogeneity is well- 
understood, samples can be partitioned into suitable subsets prior to network inference ( |Altay| 



et al. 2011 ) (or other supervised network-based approaches (Chuang et al. 2007 1). However, in 



practice, molecular classifications that underpin such stratifications may be uncertain and more- 
over hitherto unknown subtypes may be present. In the latter case, subtype identification is itself 
of independent interest, as in the context of many diseases, including cancer. A crucial observa- 
tion is that if subtypes differ with respect to underlying network structure, clustering and network 
inference become coupled tasks. Clustering methods that do not model cluster-specific network 
structure (including K-means, hierarchical clustering or model-based clustering with diagonal co- 
variance matrices (de Souto et al. 2008 1), may lead to cluster assignments that do not reflect 
the underlying biology and that may also compromise the ability to elucidate network structure. 
Equally, structural inference based on the full, unclustered data can be severely confounded by the 
data heterogeneity. 
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As cluster-specific network models, we use sparse Gaussian graphical models. These are 
multivariate Gaussian models in which an undirected graph is used to represent conditional inde- 
pendence relationships between variables. Inferring the edge set of a Gaussian graphical model 
is equivalent to identifying the location of non-zero entries in the precision matrix (see Section 
2.1 1 below for details). There is a rich literature on precision matrix estimation in the context of 



sparse Gaussian graphical models, with the seminal paper by Dempster (1972) proposing sparse 
estimation by setting entries in the precision matrix to zero. Edwards ( |2000| ) provides a review 
of standard approaches, such as greedy stepwise backward selection, for identifying zero entries 
in the precision matrix. More recent approaches have focused on using regularization, and l\ pe- 
nalization in particular, to achieve sparsity. Meinshausen and Buhlmann ( 2006| ) use Zi-penalized 
regression (lasso; Tibshirani} 1996 ) to perform neighborhood selection for each node in the graph. 
A sparse precision matrix can subsequently be obtained via constrained maximum likelihood esti- 
mation using the inferred sparse graph structure. Maximum penalized likelihood estimators with 
i penalty applied to the precision matrix have been proposed by |Yuan and Lin| (p007|) ; [Fried 



an 



man et a/.] ( |2008[ ); |Rothman et a/. | ([2008) and D'Aspremo nt et a/.| ( |2008] ). Analogous to the lasso, 



where sparse models are encouraged by shrinking some regression coefficients to be exactly zero, 
the l\ penalty on the precision matrix encourages sparsity by estimating some matrix entries as 
exactly zero. Since a sparse precision matrix corresponds to a sparse Gaussian graphical model 
structure, l\ penalized estimation is well-suited for inference of molecular networks, where spar- 
sity is often a valid assumption. Moreover, regularization enables estimation in the challenging 
'large p, small n' regime that is ubiquitous in these settings, but renders standard covariance esti- 
mators inapplicable or ill-behaved. 

Our work adds to the literature in two main ways. First, the penalized mixture-model for- 
mulation we propose extends previous work. Mukherjee and Hill ( 201 1[ > put forward a related 
'network clustering' approach, but this is not rooted in a formal statistical model and estimation 
is carried out using a heuristic, K-means-like algorithm with 'hard' cluster assignments. We show 
empirically that likelihood-based inference via an EM formulation confers benefits over this ap- 
proach. EM algorithms for penalized likelihoods have previously been proposed for finite mixture 
of regression models (Khalili a nd Chen] 2007t[Stadler et al\ 2010[ ) and for penalized model-based 
clustering ( |Pan and Shen| [2007 ; Zhou et al. |2009[ ). The approach in Zhou et aT| ( |2009| ) is similar 
to the one here. However, our l\ penalty takes a more general form, allowing also for dependence 
on mixing proportions at the level of the full likelihood. We show that at smaller sample sizes in 
particular, the l\ penalty we propose offers substantial gains. Furthermore, while we are interested 
in both clustering and cluster-specific network estimation, Zhou et al. ( 2009 ) focus on the use of 
variable selection to improve clustering accuracy. 

Second, we present empirical results investigating the performance of penalisation regimes. A 
penalty parameter controls the extent to which sparsity is encouraged in the precision matrix and 
corresponding graphical model. The choice of method for setting the penalty parameter together 
with the different forms of the i\ penalty itself result in several possible regimes that can be 
difficult to choose between a priori. Our results show that the choice of regime can be influential 
and suggest general recommendations. 

The remainder of this paper is organized as follows. In the next Section we introduce i\- 
penalized estimation for Gaussian graphical models and model-based clustering, and then go on 
to describe the proposed mixture model. In Section 3 we present an empirical comparison, on 
synthetic data, of several regimes for the l\ penalty term and tuning parameter selection. In Section 
4 we close with a discussion of our findings and suggest areas for future work. 
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2 Methods 



2.1 Penalized estimation of Gaussian graphical model structure 

Let X = (Xi, . . . , X p ) T denote a random vector having p-dimensional Gaussian density /(/x, S) 
with mean /i. and covariance matrix X. A Gaussian graphical model uses an undirected graph 
G = (V, E) to describe conditional independence relationships between the random variables 
X\, . . . , X p . The p vertices V of the graph are identified with X\, . . . , X p with edge ^ E 
if and only if Xi is conditionally independent of Xj given all other variables, or equivalently, if 
and only if there is zero partial correlation between Xi and Xj given all other variables (pij = 0). 
Let fl = denote the inverse covariance or precision matrix, and let w,j be entry of Q. 
Then, the relationship between fi and partial correlations is given by pij = — -^M=. Therefore, 

V 14 33 

non-zero entries in fi correspond to edges in the GGM, that is Uij ^ <J=^ G -E. Thus, 

inferring the edge set of a GGM is equivalent to identifying the location of non-zero entries in the 
precision matrix. 

Suppose xi, . . . , x n , with x, = (xu, ■ . . , Xi p ) T is a random sample from /(/x, S). Let x = 
SILi x * denote sample mean and E = ^ X^iLi ( x « — x )( x « — X ) T sample covariance. The 
precision matrix fi may be estimated by maximum likelihood. The log-likelihood function is 
given, up to a constant, by 

z(n) = log |n| - tr(ns) (i) 

where |-| and tr(-) denote matrix determinant and trace respectively. The maximum likelihood 

estimate is given by inverting the sample covariance matrix, Q = S . However for n < p, X! is 
singular and so cannot be used to estimate Q. Even when n > p, Ct can be a poor estimator for 
large p and does not in general yield sparse precision matrices. 

Sparse estimates can be encouraged by placing an l\ penalty on the entries of the precision 
matrix 17. This results in the following penalized log-likelihood: 

i p (n) = iog|n|-tr(ns)-A||n[| 1 (2) 

where \\ft\\i = Ylij i s trie elementwise l\ matrix norm and A is a non-negative tuning 
parameter controlling sparsity of the estimate. The maximum penalized likelihood estimate is ob- 
tained by maximizing (|2]) over symmetric, positive-definite matrices. This is a convex optimization 



problem and several procedures have been proposed to obtain solutions. Yuan and Lin (2007 1 used 



the maxdet algorithm, while |D'Aspremont et al ([2008 ) proposed a more efficient semi-definite 



programming algorithm using interior point optimization. Rothman et al. ( 2008 ) offered a fast ap 



proach employing Cholesky decomposition and the local quadratic approximation, and Friedman 



et al. ([2008 ) proposed the even faster graphical lasso algorithm, based on the coordinate descent 



algorithm for the lasso. We use the graphical lasso algorithm in our investigations and refer the 
interested reader to the references for full details. 



2.2 Gaussian mixture models 

We now suppose xi , . . . , x n is a random sample from a finite Gaussian mixture distribution, 

K 

/(x»; 0) = ^vr fc / fc (xi | /Lt fc ,S fc ) (3) 

k=l 
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where the mixing proportions tt^ satisfy < tt^ < 1 and J2k=i ^k = 1> fk is the p-dimensional 
multivariate Gaussian density with component-specific mean /x fc and covariance and = 
{(7Tfc, fi k , : k = 1, . . . , K} is the set of all unknown parameters. The log-likelihood for the 
sample is given by 



/(G) 




(4) 



Maximizing this log-likelihood is difficult due to its non-convexity. The Expectation-Maximization 
(EM) algorithm (Dempste r et al.\ 1977[ ) can be used to obtain maximum likelihood estimates. 

In model -based clustering ( |Fraley a nd Raftery, 1998 \ McLachlan et al. 2002), each mixture 
component corresponds to a cluster. In the present setting, since each cluster (or component) is 
Gaussian distributed with a cluster-specific (unconstrained) covariance matrix, each cluster repre- 
sents a distinct Gaussian graphical model. 



2.3 Mixture of penalized Gaussian graphical models 

In the Gaussian mixture model with cluster-specific covariance matrices, the number of parameters 
is of order Kp 2 . Estimation is more challenging than for a single precision matrix (or Gaussian 
graphical model) and so, as described above, in settings where number of variables p is moderate- 
to-large in relation to sample size n, overrating and invalid covariance estimates are a concern. We 
employ an i\ penalty on each of the K precision matrices to promote sparsity and ameliorate these 
issues. Such l\ penalties have previously been proposed for clustering with Gaussian graphical 
models ( [Zhou et al.\ [20091 pukherjee and Hiil[|20TT] ). 
We propose the following penalized log-likelihood, 



K 



(5) 



\k=l 



where the penalty term is given by 



A 



PA, 7 (©) = A^TT 



n 



fciii 



(6) 



fc=i 



and 7 is a binary parameter controlling the form of the penalty term. Setting 7 = results in 
the conventional penalty term, as used in Zhou et al. ( 2009| ), with no dependence on the mixing 
proportions irk- Setting 7 = 1 weights the penalty from each cluster by its corresponding mixing 
proportion. While this form of penalty is novel in this setting, an analogous penalty has been 
proposed by Khahli and Chen (2007 1 and Stadler et al. (2010) for l\ -penalized finite mixture of 



regression models. In this work, we empirically compare these two forms of penalty term for 
clustering with, and estimation of, Gaussian graphical models. 



2.4 Maximum penalized likelihood 

As with the unpenalized log-likelihood Q, the penalized likelihood ([5]) can be maximized using 
an EM algorithm, which we now describe. Our algorithm is similar to that of Zhou et al. (2009), 
but they consider only the 7 = regime and also penalize the mean vectors to perform variable 
selection. 
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Let z% be a latent variable satisfying z% = k if observation Xj belongs to cluster k. Then we 
have P(zi = k) = ttu andp(xj | z% = k) = /fe(xj | X^). The penalized log-likelihood for the 
complete data {xj, Z{}™ =1 is 



t=l 



(7) 



In the E-step of the EM, given current estimates of the parameters 8", we compute 
Q(0|0W) = e[/ PiC (0)| {Xi}ti,® {t)] 



n K 



Yl T ik IM 7 ^) + (f k (Xj I H k , E fc ))] - ^PA, 7 (0) 



i=l fc=l 



(8) 



where is the posterior probability of observation x, belonging to cluster k, 



_(f) 



7T 



spK (t) , i | m T ( 



.(*) v(*) 



(9) 



and can be thought of as a 'soft' cluster assignment. 

In the M-step we seek to maximize Q(® \ Q^) with respect to to give new estimates for 
the parameters ©C +1 ). When 7 = the mixture proportions tt^ do not appear in the penalty 
term p\^(@) and so we use the following standard EM update for unpenalized Gaussian mixture 
models: 



7T 



En { 
i=l T i 



(t) 



n 



(10) 



For 7 = 1, since it^ appears in the penalty term, maximization of Q(& \ ©W) with respect to 
7Tfc is non-trivial. We follow Khalili and Chen (2007) and use the standard update (10 1. If the 
standard update improves Q(& | 0^) then this is sufficient to obtain (local) maxima of §5^. An 
improvement is not guaranteed here, but as found in Khalili and Chen (2007), the method works 
well in practice. 

Since the penalty term is independent of we again use the standard update, 



(t+i) 



Y n T W X 
V" T (t) 



(11) 



ik 



The update for S^, or equivalently n&, i s given by 



arg max 



J^r>t } (log|n fc |-tr(n fc S 

Li=l 
[log I 



(*)- 

k ■ 



nX ( 7T 



r (*+i) 



in 



feiii 



arg max log \flk\ ~ tv(ft k S K k ') - X k 



where 



i(*) 



(t) 

ik 



X; 



V n r 
2_/i=l T i 



fclll 



(12) 



(13) 
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is the standard EM update for £ and 

' At+i) 



The optimization problem in |l2| ) is of the form of that in (J2J) with £ replaced by and a scaled 



tuning parameter A^ . Hence we can use the efficient graphical lasso algorithm (Friedman et al. 
2008 ) to perform the optimization. 
From (flOl we have 



[A if 7 = 1 

Hence, when 7 = 0, \$ is a cluster-specific parameter inversely proportional to effective cluster 
sample size, whereas 7 = 1 simply yields A. We note that, even though 7 = 1 gives a cluster- 
specific tuning parameter in the penalized log-likelihood ([5]) while 7 = does not, the converse is 
actually true for the EM updates ( fT2] >. 
Our overall algorithm is as follows: 

1. Initialize 0^°^: Randomly assign each observation Xj into one of K clusters, subject to 
a minimum cluster size n m i n . Set 7r[ = rifc/n where is the number of observations 
assigned to cluster k, set fi^ to sample mean of cluster k, and set fi[ to the maximum 



penalized likelihood estimate for the cluster k precision matrix (using (|2 
2. E-step: Calculate posterior probabilities ('soft' assignments) rif using (|9| 



3. M-step: Calculate updated parameter estimates using ( 10 !-( 15 ). 



Iterate or terminate: Increment t. Repeat steps 2 and 3, or stop if one of the following 
criteria is satisfied: 



• A maximum number of iterations T is reached; t > T. 

• A minimum cluster size n m \ a is reached; Y^=i T ik < n mm for some k. 

• Relative change in penalized log-likelihood is below a threshold e; 

\l p (Qp/l p (Gf-V-l\<e. 

In all experiments below we set T = 100, n m \ a = 4 and e = 10~ 4 . Since the EM algorithm 
may only find local maxima, we perform 25 random restarts and select the one giving the highest 
penalized log-likelihood. 'Hard' cluster assignments are obtained by assigning observations to the 
cluster k with largest probability Tjfe. 



2.5 Tuning parameter selection 

Two approaches are commonly used to set the tuning parameter: cross-validation (CV) and criteria 
such as BIC. In multifold CV, the data samples are partitioned into M data subsets, denoted by 

X( m ) for m = 1, . . . , M. Let 0^ m = {(tt^x, MfcAi ^Aia) ■ k = 1, . . . , K} denote the penalized 
likelihood estimate, obtained using tuning parameter A and by application of the EM algorithm 
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described above to all data save that in subset X^" 1 ) (training data). Performance of this estimate 
is assessed using the predictive log-likelihood; that is, Equation Q applied to subset X*" 1 ) (test 
data). This is repeated M times, allowing each subset to play the role of test data. The CV score 
is 



M / K 



cv(a)=e £ log E4A m) / fe (x, i Air^Sr ) ■ 

m=l i:x,GX( m ) \k=l / 

Then we choose A that maximizes CV(A), where the maximization is performed via a grid search. 
Finally, the selected value is used to learn penalized likelihood estimates from all data. 

In the larger sample case, an alternative to multifold CV is to partition the data into two and 
perform a single train/test iteration, selecting A that maximizes the predictive log-likelihood on 
the test data with penalized parameter estimates from the training data. 

We define the following BIC score for our penalized mixture model: 

BIC(A) = -2/(0 A )+df A log(n) (17) 

where ?(•) is the unpenalized log-likelihood @\ is the penalized likelihood estimate obtained 
with tuning parameter A and df \ is degrees of freedom. Yuan and Lin| (2007 ) proposed an estimate 



of the degrees of freedom for l\ -penalized precision matrix estimation, which generalizes to our 
penalized Gaussian mixture model setting to give 

K 

df A = K(p+l)-l + ^2 #{(j,f) ■■ j < f, {&kx) jf + 0}. (18) 
k=l 

where {^k\)jj' is element (j, j') in f2&A, the penalized likelihood estimate for the cluster k preci- 
sion matrix, using tuning parameter A. Using a grid search, we choose A that minimizes BIC(A). 

BIC is often preferred over CV as it is less computationally intensive. However, we note that, 
even BIC can be computationally expensive when used within clustering since each A value in 
the grid search requires a full application of EM-based clustering. Hence, to reduce computation 
time, we also consider a heuristic, approximate version of these approaches. The heuristic we 
propose relies on the notion that the optimal tuning parameter value does not depend strongly on 
cluster assignments but rather largely on general properties of the data (such as p and n). The 
approach proceeds as follows. First, observations are randomly assigned to clusters, producing 
K pseudo-clusters each with mean size n/K. Second, parameter estimates are obtained for the 
pseudo-clusters, tt^ is taken to be the proportion of samples in pseudo-cluster k and p, k is the 
sample mean of pseudo-cluster k. Then, for varying A, we obtain penalized estimates Ct^x by 
optimizing ([2]) for each pseudo-cluster with the graphical lasso. This can be done efficiently using 
the glassopath algorithm in R (Friedm an et a/.||2008| ) which obtains penalized estimates for 
all considered values of A simultaneously. Third, using these estimates, CV (BIC) scores are 
calculated and maximized (minimized) to select A. These three steps are repeated multiple times 
and A values obtained are averaged to produce a final value. 

3 Simulated data 

In this section we apply the l\ -penalized Gaussian graphical model clustering approach to simu- 
lated data. We consider a number of combinations of l\ penalty term and tuning parameter scheme 
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(as described in Methods above) and assess their performance in carrying out three related tasks. 
First, recovery of correct cluster assignments. Second, estimation of cluster-specific graphical 
model structure (i.e. location of non-zero entries in cluster-specific precision matrices). Third, 
estimation of cluster-specific precision matrices (i.e. estimation of matrix elements, not just loca- 
tions of non-zero entries). We note that this latter task is of less interest here since we are mainly 
concerned with clustering and inference of cluster-specific network structure. 



3.1 Data generation 

In our simulation we considered p-dimensional data consisting of K = 2 clusters, each with 
a known and distinct Gaussian graphical model structure (i.e. sparse precision matrix). Sparse 
precision matrices were created using an approach based on that used by Rothman et al. ( 2008 ) 



and |Cai et al. ( |2011| ). In particular, we created a symmetric p x p matrix B\ with zeros everywhere 



except for p randomly chosen pairs of symmetric, off-diagonal entries, which took value 0.5. A 
second matrix B<i was created from B\ by selecting half of the p non-zero symmetric pairs at 
random and relocating them to new randomly chosen symmetric positions. We then set fi^ = 
5kl, where 5^ is the minimal value such that is positive-definite with condition number less 
than p. Finally, the precision matrices were standardized to have unit diagonals. This resulted 
in cluster-specific Gaussian graphical models each with p edges, half of which were shared by 
both network structures. Data were generated from J\f (0, fi^ 1 ) and J\f (;^pl> ^2 f° r dusters 
1 and 2 respectively, where 1 is the vector of ones. The mean of cluster two is defined such that the 
parameter a sets the Euclidean distance between the cluster means. In the experiments below we 
consider p = 25, 50, 100 and cluster sample sizes of = 15, 25, 50, 100, 200. We set a = 3.5, 
resulting in individual component- wise means for cluster two of 0.70, 0.50 and 0.35 for p = 25, 50 
and 100 respectively. This reflects the challenging scenario where clusters do not have substantial 
differences in mean values, but display heterogeneity in network structure while also sharing some 
network structure across clusters. 



3.2 Methods and regimes 

We assessed ability to recover correct cluster assignments from 50 simulated datasets, under the 
following four regimes for the penalty term p\ i7 (0) in (|6j): 7 = or 1 and A set by BIC or a 
train/test scheme, maximizing the predictive log-likelihood on an independent test dataset with 
cluster sample sizes matching the training dataset. These regimes are described fully above and 
summarized in Table [T] We also compared with (i) K-means; (ii) standard non-penalized full- 
covariance Gaussian mixture models estimated using EM; and (hi) 'network clustering', an l\- 



penalized Gaussian graphical model clustering approach proposed by Mukherjee and Hill (2011 
This is similar to the approach employed here but uses a heuristic, K-means-like algorithm with 
'hard' cluster assignments rather than a mixture-model formulation with EM. For (i) we used the 
kmeans function in the MATLAB statistics toolbox with K=2 and 1000 random initializations 
and for (hi) we used MATLAB function network.clustering (Mukherjee and Hill] 2011 



For (ii) and (iii) we used the same stopping criteria as described in Methods above (namely, T = 
100, n m i n = 4 and e = 10~ 4 ) and again carried out 25 random restarts. Method (iii) requires 
maximization of K penalized log-likelihoods of form ([2]) above (one for each cluster). For setting 
penalty parameters for this method, we considered either a single tuning parameter A shared across 
both clusters and set by BIC or train/test, or cluster-specific tuning parameters A&, set analytically 
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Table 1: Clustering methods and regimes investigated, with corresponding abbreviations. 



Method 



Penalty term 



Tuning parameter 
selection 



Abbrev. 



mixture of t\ -penalized 
Gaussian graphical 
models with EM 
('soft' assignments) 

l\ -penalized Gaussian 
graphical models 
('hard' assignments) 

Mukherjee and Hill 2011 



p\ jl (@) with 7 = 0: 

^J2k=i ll^fclli 

PA >7 (©) with 7 = 1: 



Train/test 



BIC 

Train/test 



^EfcLi^fc ll^felli 

A||n fc || lt A; = l,...,Jf 

\ k \\n k \\ 1 ,k = i,...,K 



BIC 

Train/test 



BIC 

Analytic 



TO 



BO 
Tl 



Bl 

Th 



Bh 

Ah 



K-means 



n/a 



n/a 



KM 



non-penalized Gaussian 
mixture model with EM 



n/a 



n/a 



NP 



1 following Banerjee et al. ' 2008 i (see text for details) 



before each call to the penalized estimator using the equation proposed by Banerjee et al. (2008 



Equation 3). All computations were carried out in MATLAB R2010a, making an external call to 
the R package glasso (Friedman et al.\ 2008). Table [T]gives abbreviations for all methods and 
regimes investigated, which are used below and in figures. 



3.3 Tuning parameter selection 

Table [2] shows average tuning parameter values selected by each regime. A grid search was used 
over values between 0.05 and 1.5, with increments of 0.05. Since, for 7 = 0, the EM update 
tuning parameters A& in (12) differ from A, we also show A^ for these regimes. Using BIC to 
set the tuning parameter results in higher values than with train/test and, as expected, A values 
increase with p and decrease with rip,. 



3.4 Cluster assignment 

Figure[T]shows Rand indices (with respect to the true cluster assignments) obtained from clustering 
the simulated data. The Rand index is a measure of similarity between cluster assignments, taking 
values between and 1 (0 indicates complete disagreement and 1 complete agreement). Box plots 
are shown over 50 simulated datasets for each (p, n^) regime. The ^i-penalized mixture model 
regimes with 7 = 1 in the penalty term (Tl/B 1) consistently provide the best clustering results. At 
the largest sample sizes both train/test (Tl) and BIC (Bl) offer good clustering performance, with 
high Rand indices reported. However, for smaller sample sizes, train/test outperforms BIC at the 
lowest data dimensionality (p = 25), while the converse is true at higher dimensions (p = 50, 100). 
The non-mixture l\ -penalized method (Th/Bh) also performs well, but the corresponding mixture 
model approaches with 7 = 1 (Tl/B 1) are, for the most part, more effective at smaller sample sizes 
(see e.g. n k = 50, p = 50, 100). This difference in performance is likely due to a combination of 
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differences in tuning parameter (Supplementary Table SI) and less accurate parameter estimation 
for the non-mixture approaches because they do not take uncertainty of assignment into account. 
Interestingly, the mixture model with conventional penalty term (7 = 0; TO/BO) shows poor 
performance relative to 7 = 1 except at larger sample sizes, with consistently poor clustering 
accuracy for < p. Similar performance is observed for the non-mixture method with analytic 
tuning parameter selection (Ah). The poor performance of these three regimes appears to be 
related to the fact that they all use cluster-specific tuning parameters (Afc for Ah and within 
EM for TO/BO), resulting in considerable differences in cluster-level penalties (see Supplementary 
Table SI). We comment further on this finding in Discussion below. Due to its inability to capture 
the cluster-specific covariance (network) structure, K-means does not perform well, even at the 
largest sample size. Conventional non-penalized mixture models did not yield valid covariance 
estimates for sample sizes < p, and for > p we only observe gains relative to K-means in 
the large sample p = 25, = 200 case. 



3.5 Estimation of graphical model structure 

Figure [2] shows results for estimation of cluster-specific network structures for the methods and 
regimes in Table[T] For K-means, clustering is followed by an application, to each inferred cluster, 
of l\ -penalized precision matrix estimation (see ([2])) with tuning parameter set by either BIC or 
train/test. 

Ability to reconstruct cluster-specific networks is assessed by calculating the true positive rate 
(TPR), false positive rate (FPR) and Matthews Correlation Coefficient (MCC), 

tpr =tTTfn' fpr= fWTtn <19) 

MCC = TP.TN-FP.FN 

y/{TP + FP)(TP + FN)(TN + FP)(TN + FN) 

where TP, TN, FP and FN denote the number of true positives, true negatives, false positives 
and false negatives (with respect to edges) respectively. MCC summarizes these four quantities 
into one score and is regarded as a balanced measure; it takes values between -1 and 1, with 



higher values indicating better performance (see e.g. Baldi et al. ( 2000[ ) for further details). Since 
the convergence threshold in the glasso algorithm is 10~ 4 , we take entries Cjij in estimated 
precision matrices to be non-zero if \ > 10~ 3 . Since cluster assignments can only be identified 
up to permutation, in all cases labels were permuted to maximize agreement with true cluster 
assignments before calculating these quantities. 

Figure [2] shows MCC plotted against per-cluster sample size rip. and Supplementary Figure SI 
shows corresponding plots for TPR and FPR. Due to selection of smaller tuning parameter values, 
BIC discovers fewer non-zeroes in the precision matrices than train/test, resulting in both fewer 
true positives and false positives. Under MCC, BIC, with either the 7 = 1 mixture model (Bl) or 
the non-mixture approach (Bh), leads to the best network reconstruction (except at small sample 
sizes with p = 25) and outperforms all other regimes at larger sample sizes. 

In general, train/test is not competitive relative to BIC; at larger sample sizes the best train/test 
regimes (Tl/Th) are only comparable with the worst performing BIC regimes (BO/KM+B). We 
note that the non-penalized mixture approach (NP), with sample size sufficiently large to provide 
valid covariance estimates, does not yield sparse precision matrices (MCC scores are approxi- 
mately zero). 
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(c) p=100 



Figure 2: Simulated data; estimation of graphical model structure. True Positive Rate (TPR), False Positive Rate (FPR) 
and Matthews Correlation Coefficient (MCC) are shown as a function of per-cluster sample size rih for the methods and 
regimes in Table[T|and at data dimensions p — 25, 50, 100. MCC is a balanced measure for classification performance, 
taking values between - 1 and 1 with higher values indicating better agreement between true and inferred networks (see 
text for details). K-means clustering was followed by l\ -penalized estimation of Gaussian graphical model structure 
with penalty parameter set by train/test ('KM+T) or BIC ('KM+B'). (Mean values shown over 50 simulated datasets 
for each [p, rife) regime, error bars show standard errors; non-penalized approach (NP) is only shown for MCC and 
could not be used for rik < p due to small sample sizes resulting in invalid covariance estimates.) 
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3.6 Precision matrix estimation 



We also assessed ability to accurately estimate underlying cluster-specific precision matrices (i.e. 
values of the matrix elements rather than only locations of non-zeros). Accuracy is assessed using 
the elementwise l\ norm, Ylk=i &k ~ > w i tn inferred clusters matched to true clusters as 
described above. Results are shown in Table |3] In contrast to clustering and Gaussian graphical 
model estimation, where BIC regimes Bl/Bh mainly provide the best performance, the train/test 
methods Tl/Th are mostly similar or better than Bl/Bh for precision matrix estimation (the ex- 
ception being small n^, higher p settings). Due to poor clustering performance, the mixture model 
approach with 7 = does not perform well unless is sufficiently large. Neither K-means clus- 
tering (followed by l\ -penalized precision matrix estimation), the penalized non-mixture approach 
with analytic tuning parameter selection (Ah), nor the non-penalized approach (NP) perform well, 
even at the largest sample size. 



3.7 Approximate tuning parameter selection 

We applied the heuristic method for setting the tuning parameter, described in Methods above, to 
the overall best-performing mixture model approach (regime B 1; 7 = 1, BIC). Figure[3]compares 
average A values obtained using the heuristic method with those resulting from the full approach; 
we also show average Rand indices and computational timings. The A values obtained via the 
heuristic scheme are well-behaved in the sense that they increase with p and decrease for larger 
nfc. We observe some bias relative to the full approach as the values obtained from the heuristic 
method are consistently higher. However, Rand indices remain in reasonable agreement and the 
heuristic offers some substantial computational gains; e.g. for p = 25 we see reduction of about 
90% in computation time. This suggests that the heuristic approach could be useful for fast, 
exploratory analyses. 



4 Discussion 

We presented a study of model-based clustering with mixtures of i\ -penalized Gaussian graphical 
models. Penalization has emerged as a key approach in high-dimensional statistics. However, 
choice of penalization set-up and the setting of tuning parameters can be non-trivial. We found that 
performance is dependent on choice of penalty term and method for setting the tuning parameter. 
Along with the standard i\ penalty (7 = in (|6]>) we considered an alternative penalty term, 
following recent work in penalized finite mixture of regression models ( Khali li and Chen| [2007 



Stadler et al. 2010), that is dependent on the mixing proportions 717. (7 = 1 in (|6])). 

From our simulation study and application to breast cancer data, we draw some broad conclu- 
sions and recommendations, as follows. The combination of the 7 = 1 penalty term (incorporating 
mixing proportions), together with the BIC criterion for selecting the tuning parameter (regime 
Bl), appears to provide the most accurate clustering and estimation of graphical model structure. 
The only exception is in settings where both dimensionality and sample size are small; here, the 
smaller tuning parameter values selected by train/test (or cross-validation) provide superior results 
(regimes T1/CV1). For estimation of the precision matrix itself (as opposed to estimation of spar- 
sity structure only), we again recommend the penalty term with 7 = 1 and find that the less sparse 
estimates provided by train/test (or cross-validation) provide slight gains over BIC, except where 
dimensionality is large relative to sample size. 



15 



Table 3: Simulated data; precision matrix estimation. Elementwise t\ matrix norm shown for the methods and regimes 
in Table [T] at varying data dimensions p and per-cluster sample sizes (smaller values indicate better agreement be- 
tween true and inferred precision matrices). K-means clustering was followed by t\ -penalized estimation of Gaussian 
graphical model structure with penalty parameter set by train/test ('KM+T') or BIC ('KM+B'). For each (p, nu) com- 
bination, the regime with lowest mean norm is highlighted in bold. (Mean matrix norm over 50 simulated datasets per 
(p, rtk) regime, standard deviations in parentheses; non-penalized, full covariance Gaussian mixture model (NP) could 
not be used for small sample sizes nk < p due to invalid covariance estimates.) 



p 


n k 


TO 


B0 


Tl 


Bl 


Th 


Bh 


Ah 


KM+T 


KM+B 


NP 




15 


64.00 


65.73 


56.88 


58.92 


60.77 


59.71 


69.21 


62.05 


158.01 






(4.00) 


(2.52) 


(4.87) 


(3.12) 


(5.59) 


(3.07) 


(2.58) 


(4.35) 


(136.36) 






25 


64.41 


66.39 


52.36 


56.64 


57.79 


57.35 


67.31 


60.02 


62.92 






(4.38) 


(2.53) 


(4.47) 


(3.61) 


(6.48) 


(4.01) 


(3.63) 


(4.35) 


(31.35) 




25 


50 


47.56 


57.83 


47.76 


49.89 


44.06 


47.51 


64.50 


60.15 


56.20 


814.79 


(9.87) 


(11.67) 


(6.83) 


(4.01) 


(2.45) 


(3.36) 


(3.28) 


(6.37) 


(4.22) 


(282.61) 




100 


35.53 


37.73 


35.43 


38.22 


35.15 


38.04 


61.41 


62.78 


54.86 


275.75 




(1.86) 


(7.06) 


(1.88) 


(2.42) 


(1-95) 


(2.35) 


(3.50) 


(8.74) 


(5.87) 


(29.26) 




200 


26.56 


27.81 


27.86 


29.73 


27.57 


29.64 


56.68 


64.78 


53.09 


112.64 




(1.48) 


(3.22) 


(2.56) 


(1-59) 


(2.52) 


(1.66) 


(3.18) 


(11.49) 


(7.38) 


(21.30) 




15 


126.47 


129.52 


119.20 


117.25 


123.93 


119.21 


137.91 


124.41 


450.23 






(5.57) 


(5.09) 


(6.54) 


(5.25) 


(9.17) 


(5.60) 


(5.81) 


(6.25) 


(377.88) 






25 


129.08 


129.75 


118.52 


113.23 


123.89 


114.08 


136.01 


124.50 


159.21 






(4.89) 


(4.46) 


(11.53) 


(5.46) 


(11.13) 


(5.54) 


(5.25) 


(6.53) 


(172.28) 




50 


50 


128.38 


129.29 


94.08 


103.06 


102.96 


102.88 


134.11 


123.51 


112.27 




(7.68) 


(3.57) 


(4.02) 


(6.16) 


(10.57) 


(5.00) 


(5.76) 


(6.45) 


(4.56) 






100 


120.24 


125.51 


78.95 


82.77 


77.24 


82.90 


127.33 


121.14 


104.06 


1366.07 




(14.90) 


(7.66) 


(3.26) 


(3.12) 


(3.56) 


(3.51) 


(7.37) 


(7.77) 


(4.46) 


(96.11) 




200 


63.69 


63.79 


63.67 


67.31 


64.10 


67.15 


119.91 


122.92 


98.90 


587.84 




(2.70) 


(2.95) 


(2.69) 


(2.44) 


(2.73) 


(2.46) 


(8.74) 


(10.50) 


(5.39) 


(23.48) 



15 


250.78 


256.29 


247.53 


236.25 


249.46 


237.79 


277.45 


262.41 


799.86 




(8.60) 


(8.09) 


(13.87) 


(10.02) 


(14.67) 


(9.50) 


(9.77) 


(10.77) 


(794.03) 




25 


256.42 


256.42 


249.49 


229.69 


253.52 


231.93 


277.61 


259.05 


427.52 




(7.57) 


(7.66) 


(14.25) 


(9.76) 


(12.55) 


(8.59) 


(9.50) 


(11.26) 


(534.15) 




100 50 


254.02 


249.76 


203.44 


205.45 


248.40 


208.04 


267.44 


244.39 


216.46 




(8.77) 


(6.96) 


(16.47) 


(6.45) 


(23.35) 


(7.93) 


(10.53) 


(10.51) 


(9.12) 




100 


269.91 


246.16 


172.36 


176.36 


167.65 


176.92 


263.57 


238.25 


203.45 




(9.40) 


(8.07) 


(6.88) 


(6.08) 


(5.26) 


(6.43) 


(12.30) 


(12.31) 


(9.27) 




200 


171.59 


239.45 


158.82 


144.53 


134.00 


144.68 


252.23 


235.31 


183.98 


3468.41 


(11.72) 


(6.33) 


(17.70) 


(4.69) 


(3.76) 


(4.31) 


(15.93) 


(19.05) 


(8.16) 


(170.53) 
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Figure 3: Simulated data; heuristic approach for tuning parameter selection, (a) Boxplots over the tuning pa- 
rameters selected by the heuristic method (see text for details) under regime Bl (mixture model with 7 = 1 and BIC) 
are shown (blue boxes), together with the corresponding values obtained with the full, non-approximate approach (red 
boxes), (b) Resulting Rand indices and (c) computational time required to set the parameter are also shown. (All results 
are over 50 simulated datasets for each (p, n^) regime). 
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The deleterious effect of the standard l\ penalty term (7 = 0), at all but the largest sample 
sizes, is intriguing. As described above, this is due to the fact that the standard penalty term leads 
to cluster-specific penalties in the EM update for the precision matrices. (Indeed, we observed 
similar results when setting cluster-specific penalties analytically in a non-mixture model setting). 
These cluster-specific penalties are inversely proportional to the mixing proportions irk '■ in itself 
this behavior seems intuitively appealing since clusters with small effective sample sizes are then 
more heavily regularized. However, we observed that a substantially higher penalty is applied to 
one cluster over the other, indicating that samples were mostly being assigned to the same cluster. 
This is likely due to the 'unpopular' cluster having a poor precision matrix estimate due to a 
large penalty. We note that this behavior is not due to (lack of) EM convergence; the penalized 
likelihood scores from these incorrect clusterings were higher than those obtained using the true 
cluster labels. 



The related non-mixture model approach proposed by Mukherjee and Hill (2011 1 also per- 
formed well in our studies, but clustering results (both from simulated and real data) indicate that 
a mixture model with EM (and 7 = 1 in the penalty term) offers more robust results. 

Although the approaches we recommend performed well in the examples we considered, sen- 
sitivity to penalty formulation and the setting of tuning parameters remain a concern for penalized 
mixtures. Further work will be needed to better understand how such approaches behave in other 
settings and in higher dimensions. Cluster-specific scaling could also pose difficulties for penal- 



ization, as discussed recently in the context of hidden Markov models in Stadler and Mukherjee 



( |2012| ), who propose penalisation using the inverse correlation matrix as a potential solution. The 
approach proposed here could be adapted to use inverse correlation in place of inverse covariance. 

We note that while for simplicity and tractability we focused on the K = 2 clusters case, the 
methods we discuss are immediately applicable to the general K-cluster case. Moreover, since the 
approach we propose is model-based, established approaches for model selection in clustering, 
including information criteria, train/test and cross-validation, can be readily employed to select or 
explore K. 

Our results demonstrate the necessity of some form of regularization to enable the use of 
Gaussian graphical models for clustering in settings of moderate-to-high dimensionality; indeed, 
we see clear benefits of penalization already in the p = 25 case. The ^i-penalty is an attractive 
choice since it encourages sparsity at the level of graphical model structure, and estimation with the 
graphical lasso algorithm (Friedman et al. 2008 ) is particularly efficient, which is important in the 
clustering setting, where multiple iterations are required. Alternatives include shrinkage estimators 
(Schafer and S trimmer] |2005[ ) and Bayesian approaches (Dobra et al. 2004[ [Jones et al. 2005 1. 
However, it has been shown that the l\ -penalized precision matrix estimator (|2]) is biased (Lam 



and Fan 2009 ). Alternative penalties have been proposed in a regression setting to ameliorate this 



issue; the non-concave SCAD penalty ( Fan and Li] 2001 1 and adaptive i\ penalty (Zou 2006), 
and have recently been applied to sparse precision matrix estimation ( Fan et al\ 2009 1. These 
penalties are generally computationally more intensive, but it remains an open question whether 
they improve clustering accuracy relative to the i\ penalty considered here. 

Graphical models based on direct acyclic graphs (DAGs) are frequently used for network in- 
ference, especially in biological settings where directionality may be meaningful (for example, 



Friedma n et al.\ 2000; Husmeier 2003| [Perrin et al. \ 2003 [ [Mu kherj ee and Speed] 2008 ; |Elhs and| 



Wong , 2008 ; Hill et al.\\20\2\ . A natural extension to the ideas discussed here would be to develop 
a clustering approach based on DAGs rather than undirected models. 

There are several recent and attractive extensions to graphical Gaussian model estimation that 
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could be exploited to improve and extend the methods we discuss. For example, the time-varying 
Gaussian graphical model approach of Zhou et al. (2010 ) could be employed, or prior knowledge 
of network structure could be taken into account ( |Anjum et al. 2009| ); such information is abun- 
dantly available in biological settings. The joint estimation method for Gaussian graphical models 
proposed by Guo et al. (2011) explicitly models partial agreement between network structures 
corresponding to a priori known clusters. Such partial agreement could be incorporated in the 
current setting where clusters are not known a priori. 
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